Outlier Detection in Univariate and Multivariate Analysis

Breastfeeding

By Thuc Dao (Thomas Dao)

Import libraries

Load data

Transform data

Data distribution

For skewness: If the number is greater than +1 or lower than –1, this is an indication of a substantially skewed distribution.

For kurtosis: If the number is greater than +1, the distribution is too peaked. If the number is less than –1, the distribution is too flat.

1. Univariate analysis

1.1. Detecting outliers using interquartile range

1.2. Detecting outliers using z-score

Z-score is also called standard score. Z-score tells how many standard deviations away a data point is from the mean.

If the z-score of a data point is more than 3 or less than -3, it indicates that the data point can be an outlier.

2. Multivariate analysis

2.1. Detecting outliers using DBSCAN algorithm

Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is a density-based clustering non-parametric algorithm.

Mechanism: it randomly selects a point that is not already assigned to a cluster or designated as an outlier, then determines if it is a core point by checking if at least a given minimum number of samples exist within a given distance. If so, then it is designated as a core point along with all points within direct reach of that point. This process is repeated until the edge of the cluster is identified where there are no more points within the epsilon disance of the cluster.

If a point does not fall within any of the potential clusters then it is deemed an outlier.

The benefit of this method is that it is unsupervised and can be used when the distribution of values in the feature space cannot be assumed.

2.2. Detecting outliers using Local Outlier Factor algorithm

The Local Outlier Factor algorithm works similarly to DBSCAN algorithm in that it examines neighbors of a point but behaves a bit differently.

The Local Outlier Factor algorithm examines a point and its neighbors to find its density and compare with the density of neighbors. If the density of a point is much smaller than that of its neighbors, it is suggested that this point is an outlier.

The key points of this algorithm are the number of neighbors to be compared with and the metric to calculate the density.

Default number of neighbors: 20 (this number should be greater if the proportion of outliers is more than 10%).

Default metric: Minkowski distance, which generalises both Euclidean distance and Manhattan distance.

The benefit of this algorithm is that it can take both the local and global properties of the dataset into account as it focuses on how isolated the sample is in respect to the surrounding neighbourhood.

2.3. Detecting outliers using Isolation Forest algorithm

Unlike other algorithms where the focus is on the normal data and then anomalies identification, the focus of Isolation Forest is intitially on anomalies identification and then the normal data.

For this algorithm, at first we need to specify the contanimation parameter, which is the proportion of the data expected to be anomalies.

From the results of other algorithms: 15 out of 195 are outliers, we will set the contamination parameter to 0.07 (7%).

2.4. Detecting outliers using Elliptic Envelope algorithm

Elliptic Envelope algorithm assumes a Gaussian distribution of the data. It tries to create an imaginery elliptic area around a given dataset where values inside that ellipse are taken to be normal data and anything outside of that are assumed to be outliers.

For this algorithm, at first we need to specify the contanimation parameter, which is the proportion of the data expected to be anomalies.

From the results of other algorithms: 15 out of 195 are outliers, we will set the contamination parameter to 0.07 (7%).

2.6. Detecting outliers using a combination of methods

As all four methods above show different outliers, we should use a combination of all results to decide the outliers.

Explanation:

4 = 4 - 0 : 4 normal points and 0 outliers in all 4 methods

2 = 3 - 1 : 3 normal points in 3 methods and 1 outlier in another method

0 = 2 - 2 : 2 normal points in 2 methods and 2 outliers in 2 other methods

-2 = 1 - 3 : 1 normal point in 1 method and 3 outliers in 3 other methods

-4 = 0 - 4 : 0 normal points and 4 outliers in all 4 methods

A decision should be made by majority, so if a point is marked as outliers in 3 or more methods, it is the final outlier (score <= -2).